fix: test-suite stability and drone camera pinhole#2545
Conversation
The test asserted >=8 messages after a fixed 1s sleep, but under heavy parallel load (pytest -n auto) the publisher/LCM threads get starved and deliver fewer, failing with 'assert 5 >= 8'. Poll up to a 15s deadline instead; this still verifies end-to-end delivery without assuming a wall-clock throughput.
| for unsub in unsubs: | ||
| _safe(unsub, "unsubscribe") |
There was a problem hiding this comment.
Generic unsubscribe label hides which subscription failed
Every unsubscribe failure is logged with the same step="unsubscribe" label, so when a cleanup failure is reported you cannot tell whether it was the agent_transport or the finished_transport (or any others) that raised. Since the unsubs list is an ordered collection built from append calls in the fixture, a zero-cost improvement is to include the index or the callable's repr in the step name.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
There was a problem hiding this comment.
Good catch — applied in the latest push (folded into the teardown commit). Each unsubscribe is now labelled by index: _safe(unsub, f"unsubscribe[{i}]"), so a failing teardown step identifies which subscription raised.
When an agent test fails mid-run (e.g. the LLM is slow under load and finished_event.wait(60) times out), a raising cleanup step skipped the rest, leaking LCM transport threads and turning the failure into a failure + teardown error. Add a shared teardown_agent_setup() that runs every step (coordinator, transports, unsubs) best-effort and logs rather than propagating. Each unsubscribe is labelled by index so a failing one is identifiable.
The camera image is logged at world/video, inside the 3D view's world origin, but no Pinhole was ever logged, so Rerun warned '2D visualizers require a pinhole ancestor to be shown in a 3D view' and the feed only rendered in the 2D pane. Register a static Pinhole at world/video built from the (now shared) camera intrinsics.
028e879 to
df228ca
Compare
|
Hi @leshy @mustafab0 @paul-nechifor @spomichter — the |
Three independent fixes surfaced while running the test suite and the demo simulation. Each is its own commit.
1.
fix(core)— flakytest_basic_deploymentThe test asserted
>= 8messages after a fixedtime.sleep(1). Under heavy parallel load (pytest -n auto) the publisher and LCM transport threads get starved and deliver fewer within that window, failing withassert 5 >= 8. It now polls up to a 15s deadline — still verifying end-to-end delivery through the deployment, without assuming a wall-clock throughput.2.
fix(tests)— resilientagent_setupteardownWhen an agent test fails mid-run (e.g. the LLM is slow under load and
finished_event.wait(60)times out), a single raising cleanup step skipped the remaining ones, leaking LCM transport threads. The autousemonitor_threadsfixture then flagged the leak, turning one failure into a failure and a teardown error. A sharedteardown_agent_setup()helper now runs every cleanup step (coordinator, transports, unsubs) best-effort and logs failures instead of propagating. Wired into bothagent_setupfixtures.3.
fix(drone)— camera pinhole for the 3D viewThe drone camera feed is logged at
world/video, inside the 3D view'sworldorigin, but noPinholewas ever logged — so Rerun warned "2D visualizers require a pinhole ancestor to be shown in a 3D view" and the feed only rendered in the dedicated 2D pane. A staticPinholeis now logged atworld/video, built from the camera intrinsics (pulled into a shared_CAMERA_INTRINSICSconstant so they can't drift from the camera module).Testing
test_basic_deploymentpasses under deliberate full-CPU starvation (previously failedassert 5 >= 8).drone-basicreplay boots cleanly; the static pinhole logs at bridge start without error.ruff checkandruff format --checkclean on all changed files.